1 Outline

In this project, I used first level administrative subdivision movement data to produce a national-level gravity model for internal migration flows in Eswatini. Then, I attempted to use this model to predict movement flows between urban areas for my selected lower-level administrative subdivision, Mkhiweni. I also produced a number of animations to visualize agent movement. In this deliverable, I summarize these procedures and comment on how this analysis could be extended and improved.

2 Working with Movement Data

WorldPop has curated a variety of data for many low and middle income countries (LMICs). For this project, I used district-level movement data for Eswatini as the basis for my gravity model. This data was produced by applying an unconstrained gravity type spatial interaction model to World Health Organization data for sub-national movement from 2010 to 2015 (so, movement is at a five year temporal resolution for this data set).

After obtaining the data, I spent several days just exploring and visualizing the data. In total, flows in the data sum out to \(\approx\) 15330, i.e., WorldPop projects that 15330 people moved from one district to another in this five year period. Next, I produced an origin-destination matrix (OD matrix) to summarize movement between districts as observed in the WorldPop data.

\[\begin{bmatrix} 0&1318&1983&308 \\ 1371&0&1714&685 \\ 1916&1587&0&909 \\ 566&1205&1761&0 \\ \end{bmatrix}\]

There are a few things to note here. First, the coding scheme for the districts (which I will stick to throughout this write-up) is: 1 - Hhohho; 2 - Lubombo; 3 - Manzini; 4 - Shiselweni. In this matrix (and in all matrices in this deliverable), rows indicate the origin node and columns indicate the destination node. For example, in the OD matrix above, the cell corresponding to row 3 and column 4 shows the flows from Manzini to Shiselweni (about 909 people). Notice also that there are zeroes along the diagonal as we are interested only in the flows between districts. I have included a plot showing the coding scheme below, along with the centroid locations for each district.

In addition to an OD matrix, I created a matrix that summarized the distances in meters between all district centroids, which is generally the standard procedure for calculating distance between regions when using a gravity model (more on this later).

\[\begin{bmatrix} 0&72325&52127&109197 \\ 72325&0&60823&71172 \\ 52127&60823&0&60652 \\ 109197&71172&60652&0 \\ \end{bmatrix}\]

Notice that this matrix is symmetric while the OD matrix is not. In the OD matrix, movement from district \(i\) to district \(j\) is not necessarily equal to movement from district \(j\) to district \(i\). However, distance between two districts doesn’t change based on which node you start from, hence the symmetric nature of the distance matrix.

Finally, I produced a plot visualizing the information contained in the OD matrix. The crosses represent the quantity of in-flows (where that node is the destination) and the filled-in circles represent the quantity of out-flows (where that node is the origin).

2.1 Animating Movement

To make these visualizations of movement a little more visually interesting, I used the gganimate package to animate movement flows as observed in the data. First, I produced a simple animation with only one agent moving from the centroid of Hhhohho to the centroid of Lubombo.

Then, I worked on adding more agents and varying the movement speeds. Having variable movement speeds wasn’t really necessary, but it was a good exercise in working with gganimate.

Next, I had agents move to and from all centroids. This required quite a bit more data wrangling and spatial computation than the previous two animations. While certainly more visually interesting than the previous two animations, the animation is a little stiff because movement is only between the district centroids.

Finally, I increased the number of agents moving from district to district. Unfortunately, I couldn’t figure out how to do this in an entirely robust manner, so I ended having to hard-code quite a few things. Still, as a proof of concept, this is a pretty neat animation! To produce this animation, I used ggplot2::geom_jitter() to displace the agents and give them each their own movement path from district to district. Then, I scaled the number of agents proportionally to match the magnitude of observed flows from the WorldPop data set (this is animation is at \(\frac{1}{25}\) scale). To make the hard-coding a little easier, I focused on only the out-flows from Hhhohho.

3 Getting Started with Gravity Models

As a first exercise in working with gravity models, I followed along with a short vignette on basic gravity model estimation procedures (you can find this exercise here). This vignette helped me get acquainted with the most basic form of the gravity model, which is generally presented as: \[T_{ij} = \frac{P^{\alpha}_{i}P^{\beta}_{j}}{D_{ij}}\] This can be broken down as follows:

This is a really neat way of modeling migration. Basically, we apply the equation for gravity between two masses to migration flows, noticing that there are certain push or pull factors (\(P_i\) and \(P_j\)), as well as a certain “friction” factors that might hinder movement between two locations (\(D_{ij}\)). This basic model can be extended to include more push or pull factors, then being considered a gravity-type spatial interaction model (GTSIM).

In the vignette, we used commuter movement data between London boroughs. Notice that this movement is at a much finer temporal granularity than the WorldPop data for Eswatini. The key difference is the London data contains short-term movement decisions (based on work-related commutes), whereas the Eswatini data represents much longer-term movement decisions. Differences in the data aside, the vignette covered the basics of applying gravity models: how to wrangle the data into the appropriate form, how to produce an OD matrix, and different ways to find key parameters. It showed that, while it is possible to achieve reasonably good results when just picking the model parameters using intuition (\(\alpha\), \(\beta\), \(\gamma\)), we can improve the accuracy by estimating the parameters using the data and leveraging a Poisson regression model. From the less refined to the more refined approach, we saw an improvement in the R2 from 0.503 to 0.673.

With these practical results in minds, I moved next towards the comprehensive paper Modeling internal migration flows in sub-Saharan Africa using census microdata from Garcia et al. (see the article here). In this paper, the authors explore the utility of GTSIMs for predicting domestic migration flows in Sub-Saharan Africa, exploring the potential for cross-country and cross-temporal applicability. They included a suite of push/pull factors in their analysis, including urban population, whether or not the locations were contiguous, and the proportion of the population that was male. They found that, while the GTSIM performed very well for certain countries (i.e., Senegal and Ghana), it performed consistently poorly in others (i.e., Malawi). Likewise, in their exploration of cross-country and cross-temporal application, they found some possibility of strong prediction in these varied settings, but only for those countries whose movement was well predicted with their own GTSIM. This implies that GTSIMs do perform well generally, but there are likely certain intrinsic elements that cause GTSIMs to systematically perform poorly on certain countries.

4 Producing a Gravity Model for Eswatini

At this point, I had everything that I needed to produce a gravity model for Eswatini at the district level. Before estimating the model, I first brought in a few additional variables as potential push/pull factors: nighttime lights and the population for each district. For both of these, I used WorldPop raster data and extracted/aggregated to each district. The final data set (with all twelve possible origin/destination combinations) is included below (“ntl” is used as an abbreviation for nighttime lights).

origin_district destination_district migration_flows distance origin_ntl destination_ntl origin_pop destination_pop
1 2 1318.9845 72325.07 239002.14 170428.13 310727.1 221120.8
1 3 1983.1981 52126.63 239002.14 255677.87 310727.1 366785.5
1 4 308.5253 109197.00 239002.14 79052.48 310727.1 190248.0
2 1 1371.8446 72325.07 170428.13 239002.14 221120.8 310727.1
2 3 1714.3924 60823.06 170428.13 255677.87 221120.8 366785.5
2 4 685.9483 71172.28 170428.13 79052.48 221120.8 190248.0
3 1 1916.4960 52126.63 255677.87 239002.14 366785.5 310727.1
3 2 1587.8183 60823.06 255677.87 170428.13 366785.5 221120.8
3 4 909.7113 60652.39 255677.87 79052.48 366785.5 190248.0
4 1 566.1081 109197.00 79052.48 239002.14 190248.0 310727.1
4 2 1205.9658 71172.28 79052.48 170428.13 190248.0 221120.8
4 3 1761.7632 60652.39 79052.48 255677.87 190248.0 366785.5

This data was then used as input into a Poisson Pseudo Maximum Likelihood (PPML) gravity model, using the implementation in the gravity library. I also tried a Double Demeaning (DDM) model, but the results were not as easily interpretable because of the transformations applied during the estimation procedure. Below, I’ve included a matrix of the model residuals for the PPML model to show that the fit is reasonably good (comparison with the OD matrix from Section 2 might be needed to properly interpret these results).

\[\begin{bmatrix} 0&151&-32&-6 \\ -24&0&46&-96 \\ -215&112&0&39 \\ -50&-75&150&0 \\ \end{bmatrix}\]

For comparison, I’ve also included a matrix of residuals for a PPML model using distance as the sole regressor.

\[\begin{bmatrix} 0&246&-20&-181 \\ 299&0&222&-421 \\ -86&95&0&-591 \\ 77&99&261&0 \\ \end{bmatrix}\]

Certainly, a quick glance at these two matrices suggests that the PPML model with additional regressors (nighttime lights and population) performs better with this data. To further explore the differences in performance, I calculated the root mean squared error (RMSE) for each model: for the model using only distance, a RMSE of \(\approx\) 267.5 was achieved; in the model with additional regressors, a RMSE of \(\approx\) 102.8 was achieved. So, the additional regressors seem to add an extra level of sensitivity to the model’s prediction, leading to improved estimates. At this point, we should be a little worried about skewed results in our PPML model given that there aren’t many data points to estimate the model on. That is, we may have overfit to our relatively few data points here which would hinder the generalizability of our model. Additionally, distance is the strongest predictor amongst our pool of variables (all but distance have very small coefficient estimates), which may also cause issues. Still, this gravity model seems to be a pretty good fit at the district level.

5 Attempting to Predict Movement Flows in Mkhiweni

Before applying the district level model to Mkhiweni (my selected second-level administrative unit), I produced a Voronoi tessellation using the centroids of the twelve de facto urban areas in Mkhiweni (these are from Project 1). This acts as a clean, but somewhat artificial, partition of the space within Mkhiweni. I have included an additional plot as a reminder of where Mkhiweni is positioned within Eswatini.

Notice that the Voronoi polygons don’t perfectly partition the urban areas: these polygons are based on the centroids for each urban area. At this point, I made the decision to use the Voronoi polygons as the geometries to extract and aggregate nighttime lights and population to (as opposed to the urban polygons themselves). This was probably not the correct decision to make here, but I wanted to use the Voronoi polygons in a meaningful way. Also, nighttime lights and population aren’t very strong predictors, so the difference here in terms of prediction accuracy is likely negligible. So, I collected all of the necessary variables and used the fitted PPML model to predict flows between urban areas in Mkhiweni. An excerpt of the predicted OD matrix for the twelve urban areas is included below.

\[\begin{bmatrix} 0&44639&22517&65366&18038& \dots \\ 44594&0&196333&75407&26914& \dots \\ 22526&196612&0&32986&30036& \dots \\ 65480&75613&33029&0&12595& \dots \\ 17997&26881&29956&12545&0& \dots \\ \vdots & \vdots & \vdots & \vdots & \vdots & \ddots \end{bmatrix}\]

It’s worth mentioning that the coding for the nodes is not the same as the original, district-level node coding–these are the urban areas within Mkhiweni, not the districts of Eswatini. In total there are 144 cells in this matrix, but only 25 cells are included above. We should immediately be suspicious of these results: many of the flows are on the order of 104, but only around 30000 people live in Mkhiweni. These flows do not seem reasonable for five year movement in Mkhiweni. It seems that the estimated model parameters don’t translate well over vast differences in spatial granularity, which is not that surprising. Certainly, in any form of regression analysis, one should be quite cautious when attempting to make predictions using values far outside the range observed in the data set used for estimation. Additionally, it seems that our model is too sensitive to differences in distance: when we attempt to extrapolate to much smaller distances, the reduced “friction” in the model cannot be properly balanced by our other regressors, so we see wildly inaccurate estimates. In the end, it is interesting to see these results, but it is also important to be aware of the limitations in the methodology.

5.1 Improving the Model

The district-level gravity model seemed to fit relatively well, indicating that there is potential utility in using a GTSIM for movement prediction in Eswatini. To refine these predictions, there are a number of things we could do. Collecting a stronger group of predictors would help pin down the regressors most important in Eswatini; Garcia et al. find strength in quite a few predictors that I didn’t manage to include in my analysis (for instance, whether or not two locations are contiguous would be easy to add and would likely greatly enhance model accuracy). Additionally, since Garcia et al. found that GTSIMs can be applied across countries relatively well, it may be worth experimenting with using another country’s (perhaps a neighboring country like South Africa, Lesotho, or Mozambique) model with Eswatini. In this analysis, we have seen that the model is not robust enough to traverse vast levels of spatial granularity while still producing reasonable estimates. While moving all the way down to movement in Mkhiweni (a second-level administrative unit) didn’t work well, perhaps applying the model to intra-district movement would work better.

To improve the temporal and spatial granularity of the model, we would likely need better data. This is a constant struggle in data science applications to development: we always need better data, but often data collection is infeasible due to monetary issues or other miscellaneous factors. So, we can produce a reasonably good gravity model for five-year movement at the district-level for Eswatini, but that seems to be about as far as we can go with this data.

While I didn’t produce additional animations to pair with the predicted movement, this would be relatively easy to do (or could be easy if I didn’t hard-code so much). In a more complete analysis, animations for predicted movement would be a nice accompaniment to the model results. In the same vein, integrating elements of the synthetic population produced in Project 2 could allow for interesting simulations that might give us insight into shifting demographic characteristics across Eswatini as agents make long-term movement decisions. Unfortunately, I didn’t quite get to the point where I could have brought in the results from Project 2, but it’s still interesting to think about what could have been accomplished if I had had a little more time!

6 The Code

Below, I’ve linked to the different scripts that I produced for this project. A disclaimer: these scripts are pretty messy, so don’t expect too much from them!